An Implementation of a Multilingual Regular Expression Segmentor for Ordinary and Morphologically Rich Lexical Tokens

نویسندگان

  • Paul Horng Jyh Wu
  • Kevin Cheong
چکیده

Lexical pattern matching and text extraction is an essential component of many Natural Language Processing applications. Following the language hierarchy first conceived by Chomsky, it is commonly accepted that simple phrasal patterns should he categorised under the class of Regular Language (RL). There are 3 operations in RL Union, Concatenation and Kleene Closure which are applied to a finite lexicon. The machinery that recognises RL is the Finite State Machine (FSM). This paper discusses and postulates that the degree with which a class of patterns exercise the aspects of RL operators, is directly proportional to the richness in morphology of lexical tokens.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prolog(elex): a New Tool to Generate Prolog Tokenizers

This paper presents a tool called Elex(Prolog) to construct tokenizers (lexical analysers, scanners , lexers) in Prolog. It is based on Elex, a multilingual scanner generator by Matthew Phillips. The paper motivates the tool, and presents its functionality and implementation. It also compares Elex(Prolog) to the only alternative Prolog scanner generator that we are aware of: plex. 1 Motivation ...

متن کامل

Universal Joint Morph-Syntactic Processing: The Open University of Israel's Submission to The CoNLL 2017 Shared Task

We present the Open University’s submission (ID OpenU-NLP-Lab) to the CoNLL 2017 UD Shared Task on multilingual parsing from raw text to Universal Dependencies. The core of our system is a joint morphological disambiguator and syntactic parser which accepts morphologically analyzed surface tokens as input and returns morphologically disambiguated dependency trees as output. Our parser requires ...

متن کامل

Table-driven look-ahead lexical analysis

Modern programming languages use regular expressions to define valid tokens. Traditional lexical analyzers based on minimum deterministic finite automata for regular expressions cannot handle the look-ahead problem. The scanner writer needs to explicitly identify the look-ahead states and code the buffering and re-scanning operations by hand. We identify the class of finite look-ahead finite au...

متن کامل

That'll Do Fine!: A Coarse Lexical Resource for English-Hindi MT, Using Polylingual Topic Models

Parallel corpora are often injected with bilingual lexical resources for improved Indian language machine translation (MT). In absence of such lexical resources, multilingual topic models have been used to create coarse lexical resources in the past, using a Cartesian product approach. Our results show that for morphologically rich languages like Hindi, the Cartesian product approach is detrime...

متن کامل

Mealy Machines are a Better Model of Lexical Analyzers

Abstract—Lexical analyzers partition input characters into tokens. When ambiguities arise during lexical analysis, the longest-match rule is generally adopted to resolve the ambiguities. The longest-match rule causes the look-ahead problem in traditional lexical analyzers, which are based on Moore machines. In Moore machines, output tokens are associated with states of the automata. By contrast...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995